Tutorial 1 - Introduction to Whisper
In this tutorial, you’ll explore the basics of OpenAI’s Whisper, a cutting-edge automatic speech recognition (ASR) model designed to convert audio into text. Whisper is renowned for its accuracy and ability to handle various languages, making it a powerful tool for projects that require speech-to-text functionality. Whether you're building an app for transcribing audio files or implementing real-time transcription in a web interface, Whisper simplifies the process by providing pre-trained models that are easy to integrate.
This tutorial will guide you through the initial setup of Whisper, covering everything from installing the necessary libraries to transcribing pre-recorded audio files. By building a solid understanding of how the Whisper ASR model works, you’ll lay the foundation for more complex applications, such as real-time transcription systems.
Helpful prior knowledge
Learning Outcomes
By the end of this tutorial, you will be able to:
- Install and configure the necessary libraries and tools to work with Whisper.
- Understand how Whisper transcribes audio files and handles different languages.
- Implement a basic audio transcription using a pre-recorded file with Whisper.
- Gain the foundational skills needed to extend the transcription process into a real-time application.
Tutorial Steps
Total steps: 10
-
Step 1: Introduction to OpenAI Whisper ASR Model
In this step, we will explore the underlying principles of the OpenAI Whisper ASR (Automatic Speech Recognition) model, how it works, and why it is such a groundbreaking tool for speech-to-text applications. Understanding the architecture and design of Whisper will provide you with a solid foundation for implementing it in real-world projects, especially when moving into more practical tasks like transcription.
Whisper is built upon a transformer-based architecture, a model that has revolutionized many areas of natural language processing (NLP) by enabling systems to better understand and generate human language. What makes Whisper unique is its ability to transcribe speech across a wide range of languages with impressive accuracy, handling not just spoken words but also complex linguistic nuances like accents, dialects, and variations in tone.
Think of Whisper like a seasoned interpreter at an international conference. A good interpreter doesn’t just translate words; they capture the speaker’s meaning, tone, and intent, even if the speaker has a strong accent or is using informal, regional expressions. Whisper works similarly. Its advanced architecture enables it to “listen” with a deep understanding of the many subtle layers of speech, much like a human interpreter who adapts to different speakers and languages on the fly.
-
Step 2: Understanding Whisper's Data Adaptability and Audio Processing
Whisper's performance is largely due to its massive training dataset, which includes audio from diverse sources, enabling it to adapt to different environments—whether you're transcribing a noisy conference call or a clean studio recording. The model processes input audio by converting it into spectrograms, which are visual representations of sound. This allows Whisper’s transformer architecture to interpret the patterns within the audio, such as variations in pitch, loudness, and frequency.
Take a look at the spectrogram below:
This spectrogram displays sound intensity across time (x-axis) and frequency (y-axis). The colors represent the energy or amplitude at different frequencies, with blue showing lower intensities and red showing higher ones. Whisper converts audio into this visual form, which allows its transformer model to “see” the structure of the sound and interpret it in a way that leads to highly accurate transcription. By analyzing these patterns, Whisper can effectively capture not only the words spoken but also nuances like intonation and speech patterns that would otherwise be lost.
-
Step 3: Whisper's Language Detection and Model Scalability
Whisper also performs language detection, which makes it adaptable to various languages without needing user input on the spoken language. By automatically detecting the language being spoken, Whisper can transcribe multilingual content in a single session. This is particularly beneficial for global applications like translating international meetings or transcribing content in multiple languages for accessibility.
Finally, Whisper is designed to be accessible, offering pre-trained models in different sizes, from tiny to large, allowing developers to choose the model that best suits their computational resources and accuracy needs. Larger models are more accurate but require more processing power, while smaller models are faster and can be deployed in environments with limited resources.
For a list of all supported languages please visit: https://pypi.org/project/openai-whisper/
Now that you have a solid understanding of how Whisper handles language detection and its scalable model sizes, it’s time to set up your development environment. In the next step, you’ll begin configuring the necessary tools and libraries to start working with Whisper on your local machine. This environment setup will lay the foundation for running Whisper and transcribing audio files in the following steps.
-
Step 4: Setting up the Development Environment
In this step, you’ll set up the tools and dependencies required to work with Whisper. This includes installing Whisper, and the necessary libraries, as well as configuring your project directory.
Proper setup is crucial to ensure that Whisper runs smoothly and can handle real-time transcription tasks. You’ll also learn how to manage dependencies using virtual environments, which is a best practice when working on Python projects.
To begin, let’s clone the project repository. Open your terminal and run the following command:
Once the project is cloned, open the project folder in your desired IDE. In this tutorial, Visual Studio Code (VSCode) was used, but feel free to use any IDE you are comfortable with.
Next, ensure that your terminal is set to the project’s root directory. After that, create a Python virtual environment to manage the project’s dependencies.
Run the following command in your terminal:
Note: You might wonder why python3 was used instead of just python. This is because on some systems, python may still refer to Python 2, while python3 explicitly calls Python 3, which is what this project needs. Depending on your system setup, python could point to Python 3, but it’s good practice to specify python3 to avoid any ambiguity, especially when working in environments where both versions may be installed.
Now, activate the virtual environment:
On macOS/ Linux:
On Windows:
With the virtual environment activated, install the required dependencies by running the following command:
At this point, your environment should be set up and ready to go. Open the project in your IDE, and let’s take a look at the directory structure:
Here’s a breakdown of the directory structure:
- sample_audio/: This folder contains sample audio files we’ll use for testing Whisper’s transcription capabilities. The files include audio.wav and I_have_a_dream.mp3, which we’ll work with in later steps.
- requirements.txt: This file lists the dependencies necessary for the project, including Whisper and related libraries.
- transcribe_1.py and transcribe_2.py: These Python scripts will contain the code for transcription. In the upcoming steps, we will modify these scripts to perform real-time transcription using Whisper.
Now that your environment is ready, it’s time to move on to the exciting part—implementing a basic Whisper inference to transcribe audio files. In the next step, we’ll begin coding and running our first transcription using Whisper.
-
Step 5: Implementing Basic Whisper Inference
In this step, you'll write your first transcription code using Whisper. This basic transcription task will help you understand how to load the Whisper model, input an audio file, and print the transcribed text. It’s a foundational step that will pave the way for more advanced usage of Whisper in future steps.
Open the transcribe_1.py file in your IDE.
Locate the comment: # TODO#1 - Implement a basic whisper inference
Below this comment, write the following code:
Save the file, then run the script by executing the following command in your terminal:
You should get an output similar to this:
Code Walkthrough
- import whisper: This imports the Whisper library, which provides access to all the transcription tools.
- whisper.load_model("base"): This line loads the "base" version of the Whisper model. Whisper provides models of various sizes, and "base" is a good starting point as it balances accuracy and speed.
- model.transcribe("sample_audio/audio.wav"): Here, we pass in the path to the audio file for transcription. The model processes this file and returns a result containing various pieces of information, including the transcribed text.
- print(result["text"]): This prints the actual transcription to the terminal, which is the main output we're interested in.
Whisper's inference process is straightforward yet powerful. It takes raw audio as input, processes it through a pre-trained model, and outputs the corresponding text. This straightforward process allows you to easily integrate speech-to-text functionality into your projects. Keep in mind that Whisper is already trained on a vast dataset, so it’s performing all the heavy lifting behind the scenes, including language detection and audio pattern recognition.
Important:
- Ensure that your audio file is properly located in the sample_audio/ directory, or the script will throw an error.
- If you get an error related to the model, check your internet connection, as Whisper will download the model if it’s not already cached locally.
In this step, you’ve successfully implemented a basic Whisper inference. You’ve learned how to load the Whisper model, transcribe an audio file, and output the transcription. This foundational knowledge will be essential as we move forward to more complex transcription tasks.
-
Step 6: Structuring the Base Transcription Script
In this step, we’ll set up the skeleton of the main transcription script. This base structure will serve as a framework for writing functionalities in the coming steps. Setting up this structure now will make it easier to understand where different pieces of code will go, allowing you to focus on specific tasks as you build the transcription process step-by-step.
Open the transcribe_2.py file in your IDE.
The base code contains TODO comments that will serve as placeholders for the code snippets you'll write in the upcoming steps. Here's how the base structure looks:
This base structure sets up the flow of your transcription process. Over the next few steps, you'll fill in each TODO section with the necessary code to complete the transcription functionality.
Save the file and run the script using the following command:
At this point, you will see no visual output but this means that the project is set up properly.
Optional Tip:
If you're using Visual Studio Code, you can install an extension called Todo Tree. This extension quickly searches your workspace for comment tags like TODO and FIXME, and displays them in a tree view in the activity bar. You can even drag this view into the explorer pane or anywhere else you'd prefer it to be.
This extension is not required to complete the project but can be useful for real-world collaboration, helping you keep track of tasks that still need to be done in a more organized manner.
-
Step 7: Importing Libraries and Checking GPU Availability
In this step, we’ll start filling out the TODO#1 section from the base transcription script. The primary goal is to import the required libraries and check whether a GPU is available to optimize the transcription process.
This step is crucial because using a GPU can significantly speed up Whisper's transcription, especially for real-time applications.
Locate the # TODO#1 - Import Libraries and Check availability of GPU comment.
Below this comment, write the following code:
Save the file and run the script using the following command:
Expected Output:
- If a GPU is available: CUDA available, using GPU.
- If a GPU is not available: CUDA not available, using CPU.
Why check for GPU availability? Whisper’s models, especially the larger ones, are computationally intensive. Using a GPU can significantly speed up the transcription process, but it’s important to note that Whisper works perfectly well on CPUs too. While a GPU may offer faster performance, especially for larger models or real-time transcription, many tasks can still be completed efficiently on a CPU. If you’re working on a local machine without a GPU, don’t worry—the transcription will still run smoothly, albeit with slightly longer processing times.
Pitfalls to avoid: If your system doesn’t have a GPU, the script will automatically fall back to the CPU, ensuring that you can still complete the transcription. If you’d like to experience the faster performance of a GPU, you can try cloud-based services like Google Colab, which offer access to free GPUs, or use a local machine with an NVIDIA GPU.
However, even without a GPU, you can successfully complete this tutorial and build powerful speech-to-text applications.
Moving on to TODO#2, which involves loading the Whisper model and outputting the model configuration. This will verify that the model has been correctly loaded and ensure we’re using the appropriate model for the task.
Look inside the transcribe() function and find # TODO#2 - Import Libraries and Check availability of GPU.
Below this comment write the following code:
Note: Be mindful of your indentions as you are writing inside a function.
Delete the pass statement in the function:
This line was a placeholder to satisfy the try-except block while the function was incomplete. Now that you're providing actual code with an output, the pass statement is no longer needed and can be safely deleted.
Save the file and run the script using the following command:
You should see an output similar to this:
Why output the model configuration? This is a great way to verify that the Whisper model has been loaded correctly. It also gives insight into the model’s dimensions, which provides useful information about how the model processes audio. These dimensions include aspects like the number of mel frequency bins (n_mels) and the vocabulary size (n_vocab), which are fundamental to how the model interprets sound. You can skip this step on your future projects to save space on the console.
In this step, you successfully imported the necessary libraries, checked for GPU availability, and loaded the Whisper model, verifying its configuration. These steps are essential in ensuring that your environment is properly set up to handle transcription tasks efficiently, especially when using more computationally intensive models.
Code Summary
- cuda.is_available(): Checks if a GPU is available for faster processing.
- whisper.load_model(): Loads the Whisper model for transcription.
- model.dims: Outputs the model’s configuration and dimensions, providing insight into its structure.
-
Step 8: Accessing the Audio File and Performing Transcription
In this step, the focus is on accessing the audio file that will be transcribed and performing the actual transcription using Whisper. By extracting the file’s title and path, you’ll ensure the transcription process runs smoothly and the results are correctly stored.
Locate the comment: # TODO#3 - Establish Access to Audio File
Below this comment write the following code:
This code will extract the file's name (without the extension) and identify the folder where the file is located.
- os.path.basename(file_path).split('.')[0]: This extracts the file name without the extension, making it easier to use later when saving transcriptions.
- os.path.dirname(file_path): This retrieves the directory where the audio file is stored, which will be important for saving the transcription in the correct location.
Save the file and run the script using the following command:
You should see an output similar to this:
Next, Locate the comment: # TODO#4 - Perform transcription and store to `result`
Below this comment write the following code:
This code runs the transcription process and outputs a message indicating the success of the transcription.
- model.transcribe(file_path, language="en", verbose=True): This performs the transcription on the specified audio file. The language="en" argument ensures that English is used, though Whisper can detect and transcribe many languages. The verbose=True argument provides more detailed output, useful for debugging and understanding how the transcription progresses.
- print(f"Transcription successful for {title}."): This outputs a confirmation message once the transcription has completed.
Save the file and run the script using the following command:
You should see an output similar to this:
This step completes the core functionality of extracting the audio file and performing the transcription. By using Whisper’s transcribe() method, the model processes the audio file and generates the transcription. The use of language="en" ensures that the transcription is performed in English, but you can adjust this to other supported languages.
In this step, you have successfully set up the logic to access an audio file and perform transcription using Whisper. This is a key step toward building your transcription tool and allows you to handle different audio files in an organized and efficient manner.
-
Step 9: Structuring the Output and Saving the Transcription
In this step, the focus is on ensuring the transcription results are saved to a folder, with each transcription file containing the start and end timestamps for each segment. This step is crucial because it organizes the transcriptions into a structured format, making it easy to review and reference them later.
Locate the comment: # TODO#5 - Create folder where the transcription will be saved to
Below this comment write the following code:
Save the file and run the script using the following command:
Once the script runs, navigate to the transcriptions folder within the project directory. You should now see that a folder has been created where the transcription files will be saved.
Locate the comment: # TODO#6 - Loop through the result segments.
Below this comment write the following code:
Save the file and run the script using the following command:
You should see an output similar to this:
After running the script, open the transcriptions folder and inspect the newly created .txt file that contains the transcribed segments along with their timestamps.
Why create a folder for transcriptions? Organizing transcriptions into a dedicated folder ensures that the results are stored separately from the source audio files, making them easier to manage. The folder structure also helps keep the workspace clean and organized.
Why use timestamps? Adding start and end timestamps for each transcribed segment makes the output more structured and easier to review. This is particularly useful if you're working on large audio files or need to reference specific sections of the transcription in future projects.
Try It Yourself:
You can experiment by changing the audio file in the sample_audio/ folder. Simply replace the file_path variable with another audio file like the I_have_a_dream.mp3, included in the sample_audio/ folder. Simply change the file name and path from the transcribe() function call at the bottom of the script to something like:
Then Save the file and run the script using the following command:
You should get an output similar to this:
You’ll see how time stamps make more sense for longer transcriptions.
Code Summary
- os.makedirs(transcription_folder, exist_ok=True): Creates the folder where transcriptions will be stored, if it doesn’t already exist.
- start.append() / end.append(): Stores the start and end times of each transcription segment.
- file.write(): Saves the transcribed text along with timestamps to a .txt file.
In this step, you successfully created a structured folder to store transcriptions and saved each segment of the transcription with precise timestamps. This is an essential step in building a transcription tool that provides organized and accessible outputs for future use.
-
Step 10: Conclusion
In this tutorial, you’ve learned how to work with OpenAI's Whisper model to transcribe audio files into text. The process involved setting up your development environment, importing the necessary libraries, and writing a structured transcription script. You covered several key topics such as:
- Checking GPU availability to optimize performance.
- Loading and using the Whisper model for transcription.
- Extracting file paths and names to organize your transcription workflow.
- Performing transcription with Whisper and storing the results in a structured format.
- Saving transcriptions with timestamps for easy reference and analysis.
Think about how this process could be applied to various real-world scenarios. What kind of applications or projects might benefit from speech-to-text functionality? Could you use this for transcribing meetings, generating captions for videos, or building accessibility tools? How would you approach transcription in different languages or with large datasets?
Think about how you could expand on what you've learned in this tutorial. Could you integrate real-time transcription or add additional functionality to your script, such as translating transcripts into different languages or summarizing the results? The possibilities are endless, and applying these skills to real-world projects will help solidify your understanding.
Find articles to support you through your journey or chat with our support team.
Help Center